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L Introduction 

This rebbrt presents the results of preliminary experiments that we did on the 
Leukemia data of Golub et al. Our goal is to device a method for selecting the 
best SyM for the task, using only training data. This involves various "model ( 
selection 0 criteria, including the leave-one-out error rate and the size of the 
margini rescaled by the largest distance between patterns, We reserved the test 
set for the "final test 9 and did not touch it yet 

We also present a parallel algorithm for SVM that was invented by Ross 
BaldicH. 

During (this report period, we did also various explorations on the Brown et al 
data, compared SVM training algorithms, and wrote a white paper on SVM 
applications. These other tasks are or will be described in separate documents. 

II. fraper review and description of the tasks 

In thein paper. 
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s present methods for analysing gene expression data obtained from 



DNA mitro-arrays in order to classify types of cancer. 
Data s^fc 

Their n&ethod is illustrated on Leukemia data. The problem is the distinction 
between two variants of Leukemia (ALL and AML). 

Their training set consists of 38 samples (27 ALL and 1 1 AML). Their test set 
has 34 samples (20 ALL and 14 AML) collected under different experimental 
conditibhs. All samples have 7129 attributes (or features) corresponding to some 
normalized gene expression value extracted from the micro-array image. 

Tasks' 

The authors investigate two tasks: 

- Claajs prediction (supervised learning): after training a classifier on the 
trail ling set, including the ALL/AML labeling information, they try to predict the 
AL1MML class labels on the test set. 

Clap}* discovery (unsupervised learning): They remove the class labels and 
p clustering technique to find whether the distinction ALL/AML can be 
bvered automatically. 

tport we concentrate on the first task dass prediction. 
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tors also address an important sub-problem: that of attribute (or feature) 
i. In this case, gene selection. The device a method that selects 50 
out of 7129. 

Algor|jhms: 

Geneis|election: 

To rojlijice the dimensionality of input space, the authors use the following 
technique: find the features (genes) that resemble most the target vector (or its 
opposite), using the following metric: 

P = (mi-m 2 )/(si + s 2 ) 
where 1 and si are the mean and standard deviation values of the given feature 
on class 1 examples (e.g. ALL examples). Similarly m2 and s 2 are the mean and 
standard deviation values of the given feature on class 2 examples (e.g. AML 
examjites). 

They iqduce the space from 7129 to 50 genes. In a more detailed technical 
memo rpndum, they mention that a number of genes from 3 to 200, selected with 
this m iihod, give similar results. This suggested to us that perhaps only 2 genes 
selected in a better way would suffice. 

Class brediction: 
^jthors i 



The abhors use a linear classifier, with the following decision function: 
ID(x) = w.(x-b) 

wherejx is an input vector (the gene expression of a patient) and w and b are a 
weight and a bias vector computed from the training data as follows: 

•w, = (m 1 -m 2 )/(s 1 + s 2 ) 
wherein^ and si are the mean and standard deviation values of the feature 
(gene jaxpression) number i on class 1 examples. Similarly nfc and 82 for class 2 

Di = (mi + m 2 )/2 

This classification method bears similarity with Bayesian classifiers assuming 
Normal data distribution, as explained in a detained TM available from the 
authors; web site. 
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ore use Self Organizing feature Maps (SOM), a well know neural 
technique invented by Kohonen, 

Norma) jzation: 

All fea ures are normalized by subtracting the mean feature value on the training 
examp les and dividing by the standard deviation, also computed on the training 
examples. 



Methodology: 

Leave one-out 

To selijft between algorithm variants, the leave-one-out method is use: one 
example of the training set is taken out Training is performed on the remaining 
examples. The left out example is used to test. The procedure is iterated over all 
examples. 
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Faatuije selection/normalization 

- Feature (gene) election, training and parameter tuning Is performed on the 
training set only. 

- Thfc features selected using the knowledge of the class labels cannot be used 
forlunsupervised learning. 

- Feature selection performed on all training examples, therefore this biases a 
little; bit the leave-one-out prediction. 

- Normalization is performed using the mean and standard deviation computed 
on |tl!ie training examples only. When applied to the leave-one-out, these 
quantities are recomputed for every left out pattern. 

Rejectio n 

In sorrie cases, it is better to refuse making a decision rather than making a 
wrong {decision (e.g. that could potentially lead to the wrong treatment). The 
dassifpr score is used to assess classification confidence. Below a certain 
confidence value, the example is 'rejected*, that is no classification decision is 
taken.! ■ 
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III. Methodology improvements 

Our exploratory experiments indicated that the problem is simple and that there 
are veryj few classification error. In particular, it is possible to find classifiers with 
zero lejave-one-out error. In order to be able to compare classifiers, we plot false 
negative/false positive curves as described below: 
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Red: Nijmber of examples of dass 1 whose decision function value is smaller than or equal to 0. 
Blue: Nienber of examples of dass 2 whose decision function value Is larger than or equal to e. 
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Thephhblemwe are interested in is a two-class problem. The class labels are (- 
1 ) for class 1 , the "negative class" and (+1 ) for class 2, the "positive class". 
The classifiers we are interested in (Colub et al or SVM) make their decision 
according to the value of a decision function D(x) of an input vector x, e.g.: 

' If D(x)<0, classify x in class 1 

. If D(x)>6, classify x in class 2 
Dfx) ajiready incorporates a bias, which is a parameter optimized by training on 
the tr« ihing set Threshold 0 is another parameter which is determined by cross- 
valida ipn. It is used in some applications for which classifying an example of 
class I 'into class 2 (type I errors) is more severe than the opposite (type II 
errors). By adjusting 6, one can monitor the ratio of one type of error over the 
other. 

One vt ay of reading the fh/fp curves is to think of the red curve as the number of 
false f ositive (type I errors) as a function of e and the blue curve as the number 
of fals 9 negative (type II errors) as a function of 6. By choosing 6 one can monitor 
the tra deoff between type I and type II errors. 
The fn/jp curves are obtained by calculating D(x) by cross-validation using the 
leave-Wout method. After 9 and other parameters are adjusted, newfn/fp 
curve* can be obtained from the test data. 

Classijibr quality 

In this application, we fix 6=0. We use the fn/fp curves for the purpose of 
evalue ting classifier quality. 

The red curve is the number of examples of class 1 whose decision function 
value s smaller than or equal to 6. The blue curve is the number of examples of 
class ; ! whose decision function value is larger than or equal to 6. We derive from 
these curves 3 parameters that characterize classifier quality: 

- Th£ total number of classification errors (sum of type I and type II errors at 
64). The fewer errors, the better the classifier. 

- The extremal margin EflJ (the re-scaled difference between min(D(x)) for 
claw 2 examples and max(D(x)) for class 1 examples). It Indicates the 
performance on the worst classified patterns. The extremal margin is negative 
if 4tre are classification error. If there are no classification errors, it is 
poilive. In some cases it may be positive and yet there are some remaining 
err >rs (the decision threshold in not in the margin area).The larger the 

ext -emal margin is, the better the classifier. 

- Th^j median margin WO (the re-scaled difference between median(D(x)) for 
class 2 examples and median(D(x)) for class 1 examples). The median 

ma -jjin is usually positive. It indicates how well the two classes are separated 
i average. The larger the median margin is, the better the classifier. 
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define a zone in which no decision of classification is made on either 
the decision threshold (e.g. light yellow region). Within this zone, 
cfation decisions are considered uncertain. 
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IV. Reproducing the baseline results 

We produced the baseline results of the Golub et al paper. We extracted from 
their fjiaper the 50 informative genes that they selected: 
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Golub et al $0 most informative genes 
for 38 training examples (27 ALL top, 11 AML bottom). 

We implemented their classification technique and obtained the following fn/fp 
curves by cross-validation (leave-one-out): 

fn/fp cuwes (extremal margiiFO.035173, median margin=0.41092) 
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Golub et al best results on their selection of SO genes. 



cjnfirnw that with the leave-one-out method they have zero error. Their 
I margin is small: the examples that are hardest to classify are not 
Id with high confidence. 
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fh/fp curves (extremal marginp-0. 25398, median margin=0.2B075) 
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Golub el el without gene selection 



We alio tried their classification technique on the while set of 7129 genes (no 
gene ielection). This test reveals the classification method of Golub et al 
performs poorly without first reducing the dimensionality of input space. The 
leaveine-out error rate is 4/38. The extremal margin is very negative. 



suits 

Moratory experiments indicated that the training set is linearly separable. 
i linear SVM, polynomial SVM and radial SVM. Linear SVMs outperform 
lethods. We concentrated our efforts on linear SVM and on the 




improvement of the feature (gane) selection method. 



We trisied an SVM classifier on their 50 features (genes). In this experiment 
(figuripn next page), it is arguable which classifier is best SVM have 1 error and 
a sma)l negative extremal margin, but the SVM median margin is larger than the 
classi ipr of Golub et al trained on the same features. 



SVMs 
obtaMd 
margin 
isctautly 
condii ipns. 



$an also be trained without feature (gene) selection. The linear SVM 
1 (figure on next page) has only 2 errors and a small positive extremal 
Its median margin is much larger than Golub et al on all genes. Overall, it 
a superior classifier to the baseline classifier trained in the same 





Linear SVM without gene selection 



SVM-ibased feature selection 



The f mature selection method of Golub et al Is designed to work be9t with their 
classffcation technique. Moreover, it is rather crude. We designed and 
implemented an SVM-based feature selection technique which proved to be 
superior and allow us to narrow down the number of genes to only two. 

Featun selection ahorifhm description 

Methdd 1 (combinatorial and slow): 



SVMs are trained using subsets of input features of the same size. The resulting 
classf iers are ranked in order of classifier quality (as measured, for instance by 
the ra|ib of margin size over the largest distance between patterns). The subset 
of features yielding the best classifier is selected. 
This method is not practical if the number of features is large and/or the size of 
the subsets is large, because of computational considerations. Therefore, we 
complement is with a second method, weaker but faster. 

Method 2 (sub-optimal but fast): 

A first ISVM classifier is trained on all the features. The features are then ranked 
in order of increasing weight: 

iw^sumfokynxw} 
where the sum runs over the support vectors x* of class polarity y* (+1 or -1 ) and 
Lagraijige multiplier on. For linear SVMs, these weights are the weights of the 
linear classifier itself, in feature space. For non-linear SVMs, these weight are 
proportional to an average of all the weights that involve input feature i. 

The inbut features corresponding to the largest weights are kept. This method is 
justifie i by the fact that removing the features with smallest weight least 
perfusates the solution. 

After Mature pruning, the classifier needs to be retrained. The procedure may be 
iterate^. 

Combihed method: 

Method : 2 is used to downsize feature space as much as possible before using 
method 1. 

In one jot our implementations, we first removed enough features to reach a 
number of input features that is a power of 2. We then iterated the feature 
pruning method by dividing by two the number of features at each step. For each 
ciassifi sr that we trained, we measured classifier quality with the ratio of margin 
size ov 0r the largest distance between patterns. Quality as a function Of number 
of input reatures was plotted. This curve exhibits a maximum. We selected the 
feature Subset of maximum quality. On that set, we used method 1 to reach a 
subset of only 2 features. 
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ft/lb curves («Mtramal margirFO.097231, median marginsO.51626) 
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SVM trained on 50 genes selected with SVM pruning method 2. 
Wfp curvec (extremal margin=0.1645B, median marflin=0.43SS9) 
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SVM Lined only on two genes (M23197 and M81833) selected with the combined method. 
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In order to compare with the baseline recognizer, we used feature pruning 
method 2 to downsize feature space to 50 features (figure below). It is interesting 
to notice that these features do not look as orderly as the features selected by 
Golub et al's method and yet perform better. This is not so surprising: all of Golub 
et al matures are very correlated and therefore carry less information than ours. 
The sKrM classifier that we trained on these features is better than the baseline 
classifier in all respects: larger extremal and median margins (figure on previous 
page)i 




50 features selected with the SVM combined method 

We used the combined method to select only two genes to perform the 
separation (figure on previous page). We also obtained perfect separation by 
cross-yeiidation. The extremal and median margins remain comfortable. The 
baseline method of Golub et al could never achieve such a result. 

The twb genes selected are: 

M81933-at : CDC25A Cell division cycle 25A 

M23is|7-at ; CD33 CD33 antigen (differentiation antigen) 




Gene M81933 and M23197 



In two dimensions, it is possible to visualize the solution obtained. 



SVM two-dimensianl separation with genes M81 933 and 




M23197 



V. ' Parallel algorithm for SVM 

We briefly describe the algorithm of Ross Baldick et al tor parallel implementation 
ofSVWfe: 



initialisation : Divide the training set into N subsets of equivalent size. Each 
processor is assigned a subset (initial working set). 
Step i|; Each processor trains an SVM with its working set 
Step 2 : The support vectors found are broadcasted and all the support vectors 
found by all processors are added to the working sets of each processor. 
Step tend 2 are iterated until convergence. 

Properties: 

- Convergence is garanteed. 

- The algorithm was invented for classification, but works also for regression. 
VI. further work 

Next, we will refine our feature selection and model selection method and dean 
up our'code. While we already outperform Golub et al comparing cross-validation 
results! we still want to build more confidence in our classifier to be sure that we 
will outperform them on the test set too. This process will include refining our 
classifier quality measurements. 

We will also solve the class discovery problem (unsupervised learning), using 
SVM. 
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